Parametric Models of Linguistic Count Data

نویسنده

  • Martin Jansche
چکیده

It is well known that occurrence counts of words in documents are often modeled poorly by standard distributions like the binomial or Poisson. Observed counts vary more than simple models predict, prompting the use of overdispersed models like Gamma-Poisson or Beta-binomial mixtures as robust alternatives. Another deficiency of standard models is due to the fact that most words never occur in a given document, resulting in large amounts of zero counts. We propose using zeroinflated models for dealing with this, and evaluate competing models on a Naive Bayes text classification task. Simple zero-inflated models can account for practically relevant variation, and can be easier to work with than overdispersed models.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fitting of Count Time Series Models on the Number of Patients Referred to Addiction Treatment Centers in Semnan County

Abstract. Count data over time are observed in many application areas. Many researchers use time series patterns to analyze this data. In this paper, the poisson count time series linear models and negative binomials on this type of data with the explanatory variables are studied. The Likelihood analysis and the evaluation of count time series model based on generalized linear models are pres...

متن کامل

Parametric and nonparametric Bayesian model specification: A case study involving models for count data

In this paper we present the results of a simulation study to explore the ability of Bayesian parametric and nonparametric models to provide an adequate fit to count data, of the type that would routinely be analyzed parametrically either through fixed-effects or random-effects Poisson models. The context of the study is a randomized controlled trial with two groups (treatment and control). Our...

متن کامل

Hurdle, Inflated Poisson and Inflated Negative Binomial Regression Models ‎ for Analysis of Count Data with Extra Zeros

In this paper‎, ‎we ‎propose ‎Hurdle regression models for analysing count responses with extra zeros‎. A method of estimating maximum likelihood is used to estimate model parameters. The application of the proposed model is presented in insurance dataset‎. In this example‎, there are many numbers of claims equal to zero is considered that clarify the application of the model with a zero-inflat...

متن کامل

A Generalized Parametric Selection Model for Non-Normal Data

I develop a new approach for sample selection problems that allows parametric forms of any type to be chosen for both for the selection and the observed variables. The Generalized Parametric Selection (GPS) model can incorporate both duration and count data models, unlike previous parametric models. MLE does not require numerical integration or simulation techniques, unlike previous models for ...

متن کامل

Spatial count models on the number of unhealthy days in Tehran

Spatial count data is usually found in most sciences such as environmental science, meteorology, geology and medicine. Spatial generalized linear models based on poisson (poisson-lognormal spatial model) and binomial (binomial-logitnormal spatial model) distributions are often used to analyze discrete count data in which spatial correlation is observed. The likelihood function of these models i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003